-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Tablet Refresh Behavior in VReplication Traffic Switch Handling #10058
Conversation
c303c69
to
7eaec39
Compare
7eaec39
to
2f1f9ff
Compare
6a1489b
to
eb75634
Compare
e7d1b54
to
67b1639
Compare
And refresh all tablets in dry runs to alert the user to potential issues. Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
1e9f82f
to
11fb7ca
Compare
This keeps the previous behavior of ignoring partial refresh related errors, while providing the defails of WHY we had a partial refresh for anyone that's interested. Signed-off-by: Matt Lord <[email protected]>
11fb7ca
to
18f9573
Compare
I ended up changing the function signature for
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -121,7 +137,7 @@ func UpdateShardRecords( | |||
// For 'to' shards, refresh to make them serve. The 'from' shards will | |||
// be refreshed after traffic has migrated. | |||
if !isFrom { | |||
if _, err := RefreshTabletsByShard(ctx, ts, tmc, si, cells, logger); err != nil { | |||
if _, _, err := RefreshTabletsByShard(ctx, ts, tmc, si, cells, logger); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called from a number of places via Wrangler.updateShardRecords()
. I don't think that we can change the partial handling here w/o other changes.
This way the caller has that info and can decide what to do with it (if anything). Signed-off-by: Matt Lord <[email protected]>
@@ -354,7 +354,7 @@ func (wr *Wrangler) cancelHorizontalResharding(ctx context.Context, keyspace, sh | |||
|
|||
destinationShards[i] = updatedShard | |||
|
|||
if _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil { | |||
if _, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't make sense to return an error when canceling.
@@ -442,7 +442,7 @@ func (wr *Wrangler) MigrateServedTypes(ctx context.Context, keyspace, shard stri | |||
refreshShards = destinationShards | |||
} | |||
for _, si := range refreshShards { | |||
_, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, cells, wr.Logger()) | |||
_, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, cells, wr.Logger()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is past the point of no return.
@@ -792,7 +792,7 @@ func (wr *Wrangler) masterMigrateServedType(ctx context.Context, keyspace string | |||
} | |||
|
|||
for _, si := range destinationShards { | |||
if _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil { | |||
if _, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is past the point of no return.
@@ -1226,7 +1226,7 @@ func (wr *Wrangler) replicaMigrateServedFrom(ctx context.Context, ki *topo.Keysp | |||
|
|||
// Now refresh the source servers so they reload the denylist | |||
event.DispatchUpdate(ev, "refreshing sources tablets state so they update their denied tables") | |||
_, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, sourceShard, cells, wr.Logger()) | |||
_, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, sourceShard, cells, wr.Logger()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is past the point of no return.
_, err := topotools.RefreshTabletsByShard(ctx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger()) | ||
rtbsCtx, cancel := context.WithTimeout(ctx, shardTabletRefreshTimeout) | ||
defer cancel() | ||
_, _, err := topotools.RefreshTabletsByShard(rtbsCtx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called when canceling.
go/vt/wrangler/traffic_switcher.go
Outdated
_, err := topotools.RefreshTabletsByShard(ctx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger()) | ||
rtbsCtx, cancel := context.WithTimeout(ctx, shardTabletRefreshTimeout) | ||
defer cancel() | ||
_, _, err := topotools.RefreshTabletsByShard(rtbsCtx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a safe/good place to return an error if we could not refresh all tablets as its early on in the operation. I'll work on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 90f656f
_, err := topotools.RefreshTabletsByShard(ctx, ts.TopoServer(), ts.TabletManagerClient(), target.GetShard(), nil, ts.Logger()) | ||
rtbsCtx, cancel := context.WithTimeout(ctx, shardTabletRefreshTimeout) | ||
defer cancel() | ||
_, _, err := topotools.RefreshTabletsByShard(rtbsCtx, ts.TopoServer(), ts.TabletManagerClient(), target.GetShard(), nil, ts.Logger()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is after the point of no return.
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! lgtm
Description
Because we refresh the tablets in source and target shards involved in a vreplication workflow during the operation (e.g.
SwitchWrites
) — in order to update the serving vschema and query execution rules — let's also do that for the dry run so that we can provide a warning for users that the operation could fail. Example output:The
topotools.RefreshTabletsByShard
call is best-effort and lets the caller know if the results were not complete (partial). We use a lower timeout within the traffic switcher when refreshing the state for each tablet (done in a goroutine) as:We also add tablet refresh to the traffic switcher's pre-check call to ensure that we have healthy tablets on the source and target shards before we begin the operation as otherwise the operation is not safe and can fail. Example output:
All of these things help to make vreplication workflows more predictable, more reliable, and incur less downtime.
Related Issue(s)
Checklist